Overview

Dataset statistics

Number of variables9
Number of observations2000
Missing cells0
Missing cells (%)0.0%
Duplicate rows730
Duplicate rows (%)36.5%
Total size in memory140.8 KiB
Average record size in memory72.1 B

Variable types

Numeric8
Categorical1

Warnings

Dataset has 730 (36.5%) duplicate rowsDuplicates
pregnancies is highly correlated with ageHigh correlation
age is highly correlated with pregnanciesHigh correlation
pregnancies is highly correlated with ageHigh correlation
skin_thickness is highly correlated with insulinHigh correlation
insulin is highly correlated with skin_thicknessHigh correlation
age is highly correlated with pregnanciesHigh correlation
skin_thickness is highly correlated with bmiHigh correlation
blood_pressure is highly correlated with bmiHigh correlation
pregnancies is highly correlated with ageHigh correlation
insulin is highly correlated with diabetes_pedigree_functionHigh correlation
bmi is highly correlated with skin_thickness and 1 other fieldsHigh correlation
age is highly correlated with pregnanciesHigh correlation
diabetes_pedigree_function is highly correlated with insulinHigh correlation
pregnancies has 301 (15.0%) zeros Zeros
blood_pressure has 90 (4.5%) zeros Zeros
skin_thickness has 573 (28.6%) zeros Zeros
insulin has 956 (47.8%) zeros Zeros
bmi has 28 (1.4%) zeros Zeros

Reproduction

Analysis started2021-10-18 18:12:09.240134
Analysis finished2021-10-18 18:12:27.679530
Duration18.44 seconds
Software versionpandas-profiling v3.0.0
Download configurationconfig.json

Variables

pregnancies
Real number (ℝ≥0)

HIGH CORRELATION
HIGH CORRELATION
HIGH CORRELATION
ZEROS

Distinct17
Distinct (%)0.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.7035
Minimum0
Maximum17
Zeros301
Zeros (%)15.0%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median3
Q36
95-th percentile10
Maximum17
Range17
Interquartile range (IQR)5

Descriptive statistics

Standard deviation3.306063033
Coefficient of variation (CV)0.8926861166
Kurtosis0.409867576
Mean3.7035
Median Absolute Deviation (MAD)2
Skewness0.9823655943
Sum7407
Variance10.93005278
MonotonicityNot monotonic
Histogram with fixed size bins (bins=17)
ValueCountFrequency (%)
1356
17.8%
0301
15.0%
2284
14.2%
3195
9.8%
4191
9.6%
5141
 
7.0%
6131
 
6.6%
7100
 
5.0%
896
 
4.8%
970
 
3.5%
Other values (7)135
 
6.8%
ValueCountFrequency (%)
0301
15.0%
1356
17.8%
2284
14.2%
3195
9.8%
4191
9.6%
5141
 
7.0%
6131
 
6.6%
7100
 
5.0%
896
 
4.8%
970
 
3.5%
ValueCountFrequency (%)
173
 
0.1%
152
 
0.1%
147
 
0.4%
1322
 
1.1%
1223
 
1.1%
1124
 
1.2%
1054
2.7%
970
3.5%
896
4.8%
7100
5.0%

glucose
Real number (ℝ≥0)

Distinct136
Distinct (%)6.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean121.1825
Minimum0
Maximum199
Zeros13
Zeros (%)0.7%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0
5-th percentile80
Q199
median117
Q3141
95-th percentile181
Maximum199
Range199
Interquartile range (IQR)42

Descriptive statistics

Standard deviation32.06863565
Coefficient of variation (CV)0.2646309133
Kurtosis0.5603705831
Mean121.1825
Median Absolute Deviation (MAD)20
Skewness0.1588058725
Sum242365
Variance1028.397392
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
9949
 
2.5%
10044
 
2.2%
10239
 
1.9%
12937
 
1.8%
9536
 
1.8%
11236
 
1.8%
10636
 
1.8%
10534
 
1.7%
12033
 
1.7%
10833
 
1.7%
Other values (126)1623
81.2%
ValueCountFrequency (%)
013
0.7%
442
 
0.1%
563
 
0.1%
575
 
0.2%
613
 
0.1%
622
 
0.1%
653
 
0.1%
672
 
0.1%
687
0.4%
719
0.4%
ValueCountFrequency (%)
1993
 
0.1%
1982
 
0.1%
1978
0.4%
1965
0.2%
1958
0.4%
19410
0.5%
1936
0.3%
1912
 
0.1%
1903
 
0.1%
1898
0.4%

blood_pressure
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct47
Distinct (%)2.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean69.1455
Minimum0
Maximum122
Zeros90
Zeros (%)4.5%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0
5-th percentile43.8
Q163.5
median72
Q380
95-th percentile90
Maximum122
Range122
Interquartile range (IQR)16.5

Descriptive statistics

Standard deviation19.18831482
Coefficient of variation (CV)0.2775063426
Kurtosis5.32848981
Mean69.1455
Median Absolute Deviation (MAD)8
Skewness-1.854476017
Sum138291
Variance368.1914255
MonotonicityNot monotonic
Histogram with fixed size bins (bins=47)
ValueCountFrequency (%)
74145
 
7.2%
70144
 
7.2%
78128
 
6.4%
68125
 
6.2%
64120
 
6.0%
72118
 
5.9%
8098
 
4.9%
6294
 
4.7%
7693
 
4.7%
6092
 
4.6%
Other values (37)843
42.1%
ValueCountFrequency (%)
090
4.5%
242
 
0.1%
303
 
0.1%
383
 
0.1%
402
 
0.1%
4411
 
0.5%
466
 
0.3%
4813
 
0.7%
5031
 
1.6%
5229
 
1.5%
ValueCountFrequency (%)
1223
 
0.1%
1143
 
0.1%
1107
0.4%
1085
0.2%
1069
0.4%
1045
0.2%
1023
 
0.1%
1009
0.4%
988
0.4%
968
0.4%

skin_thickness
Real number (ℝ≥0)

HIGH CORRELATION
HIGH CORRELATION
ZEROS

Distinct53
Distinct (%)2.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean20.935
Minimum0
Maximum110
Zeros573
Zeros (%)28.6%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median23
Q332
95-th percentile44.05
Maximum110
Range110
Interquartile range (IQR)32

Descriptive statistics

Standard deviation16.10324291
Coefficient of variation (CV)0.7692019541
Kurtosis0.1555797786
Mean20.935
Median Absolute Deviation (MAD)12
Skewness0.2072281256
Sum41870
Variance259.3144322
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0573
28.6%
3283
 
4.2%
3075
 
3.8%
2360
 
3.0%
2758
 
2.9%
1854
 
2.7%
2854
 
2.7%
3952
 
2.6%
3351
 
2.5%
3150
 
2.5%
Other values (43)890
44.5%
ValueCountFrequency (%)
0573
28.6%
73
 
0.1%
86
 
0.3%
1013
 
0.7%
1114
 
0.7%
1221
 
1.1%
1330
 
1.5%
1415
 
0.8%
1533
 
1.7%
1615
 
0.8%
ValueCountFrequency (%)
1102
 
0.1%
992
 
0.1%
633
0.1%
602
 
0.1%
592
 
0.1%
563
0.1%
544
0.2%
524
0.2%
513
0.1%
507
0.4%

insulin
Real number (ℝ≥0)

HIGH CORRELATION
HIGH CORRELATION
ZEROS

Distinct182
Distinct (%)9.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean80.254
Minimum0
Maximum744
Zeros956
Zeros (%)47.8%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median40
Q3130
95-th percentile293
Maximum744
Range744
Interquartile range (IQR)130

Descriptive statistics

Standard deviation111.1805335
Coefficient of variation (CV)1.385358157
Kurtosis5.128261644
Mean80.254
Median Absolute Deviation (MAD)40
Skewness1.996084356
Sum160508
Variance12361.11104
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0956
47.8%
10531
 
1.6%
14024
 
1.2%
18023
 
1.1%
13022
 
1.1%
12021
 
1.1%
10020
 
1.0%
9417
 
0.9%
13517
 
0.9%
7617
 
0.9%
Other values (172)852
42.6%
ValueCountFrequency (%)
0956
47.8%
143
 
0.1%
153
 
0.1%
163
 
0.1%
185
 
0.2%
223
 
0.1%
234
 
0.2%
252
 
0.1%
293
 
0.1%
322
 
0.1%
ValueCountFrequency (%)
7442
 
0.1%
6802
 
0.1%
6002
 
0.1%
5794
0.2%
5452
 
0.1%
5403
0.1%
5103
0.1%
4955
0.2%
4853
0.1%
4807
0.4%

bmi
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct247
Distinct (%)12.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean32.193
Minimum0
Maximum80.6
Zeros28
Zeros (%)1.4%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0
5-th percentile21.8
Q127.375
median32.3
Q336.8
95-th percentile45.01
Maximum80.6
Range80.6
Interquartile range (IQR)9.425

Descriptive statistics

Standard deviation8.149900701
Coefficient of variation (CV)0.2531575405
Kurtosis4.131722134
Mean32.193
Median Absolute Deviation (MAD)4.7
Skewness-0.09045533681
Sum64386
Variance66.42088144
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
31.233
 
1.7%
3233
 
1.7%
31.629
 
1.5%
028
 
1.4%
33.327
 
1.4%
32.825
 
1.2%
32.425
 
1.2%
32.924
 
1.2%
30.824
 
1.2%
30.122
 
1.1%
Other values (237)1730
86.5%
ValueCountFrequency (%)
028
1.4%
18.28
 
0.4%
18.42
 
0.1%
19.12
 
0.1%
19.33
 
0.1%
19.42
 
0.1%
19.56
 
0.3%
19.66
 
0.3%
203
 
0.1%
20.15
 
0.2%
ValueCountFrequency (%)
80.62
0.1%
67.13
0.1%
64.42
0.1%
59.43
0.1%
57.33
0.1%
553
0.1%
53.23
0.1%
52.93
0.1%
52.72
0.1%
52.34
0.2%

diabetes_pedigree_function
Real number (ℝ≥0)

HIGH CORRELATION

Distinct505
Distinct (%)25.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.47093
Minimum0.078
Maximum2.42
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum0.078
5-th percentile0.141
Q10.244
median0.376
Q30.624
95-th percentile1.136
Maximum2.42
Range2.342
Interquartile range (IQR)0.38

Descriptive statistics

Standard deviation0.3235525587
Coefficient of variation (CV)0.687050217
Kurtosis5.006839839
Mean0.47093
Median Absolute Deviation (MAD)0.168
Skewness1.811978894
Sum941.86
Variance0.1046862582
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0.25816
 
0.8%
0.20715
 
0.8%
0.26813
 
0.7%
0.29213
 
0.7%
0.26113
 
0.7%
0.23813
 
0.7%
0.5213
 
0.7%
0.28412
 
0.6%
0.55112
 
0.6%
0.25912
 
0.6%
Other values (495)1868
93.4%
ValueCountFrequency (%)
0.0782
 
0.1%
0.0842
 
0.1%
0.0855
0.2%
0.0886
0.3%
0.0892
 
0.1%
0.0922
 
0.1%
0.0963
0.1%
0.13
0.1%
0.1012
 
0.1%
0.1022
 
0.1%
ValueCountFrequency (%)
2.423
0.1%
2.3292
0.1%
2.1373
0.1%
1.8932
0.1%
1.7812
0.1%
1.7313
0.1%
1.6992
0.1%
1.6983
0.1%
1.63
0.1%
1.4762
0.1%

age
Real number (ℝ≥0)

HIGH CORRELATION
HIGH CORRELATION
HIGH CORRELATION

Distinct52
Distinct (%)2.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean33.0905
Minimum21
Maximum81
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size15.8 KiB

Quantile statistics

Minimum21
5-th percentile21
Q124
median29
Q340
95-th percentile58
Maximum81
Range60
Interquartile range (IQR)16

Descriptive statistics

Standard deviation11.78642311
Coefficient of variation (CV)0.3561875193
Kurtosis0.8263829494
Mean33.0905
Median Absolute Deviation (MAD)7
Skewness1.181267223
Sum66181
Variance138.9197696
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
22192
 
9.6%
21166
 
8.3%
25134
 
6.7%
24122
 
6.1%
23103
 
5.1%
2898
 
4.9%
2684
 
4.2%
2781
 
4.0%
2970
 
3.5%
3158
 
2.9%
Other values (42)892
44.6%
ValueCountFrequency (%)
21166
8.3%
22192
9.6%
23103
5.1%
24122
6.1%
25134
6.7%
2684
4.2%
2781
4.0%
2898
4.9%
2970
 
3.5%
3056
 
2.8%
ValueCountFrequency (%)
813
 
0.1%
723
 
0.1%
703
 
0.1%
696
0.3%
683
 
0.1%
6710
0.5%
6612
0.6%
658
0.4%
643
 
0.1%
6313
0.7%

outcome
Categorical

Distinct2
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Memory size15.8 KiB
0
1316 
1
684 

Length

Max length1
Median length1
Mean length1
Min length1

Characters and Unicode

Total characters2000
Distinct characters2
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row1
2nd row0
3rd row1
4th row1
5th row0

Common Values

ValueCountFrequency (%)
01316
65.8%
1684
34.2%

Length

Histogram of lengths of the category

Pie chart

ValueCountFrequency (%)
01316
65.8%
1684
34.2%

Most occurring characters

ValueCountFrequency (%)
01316
65.8%
1684
34.2%

Most occurring categories

ValueCountFrequency (%)
Decimal Number2000
100.0%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
01316
65.8%
1684
34.2%

Most occurring scripts

ValueCountFrequency (%)
Common2000
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
01316
65.8%
1684
34.2%

Most occurring blocks

ValueCountFrequency (%)
ASCII2000
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
01316
65.8%
1684
34.2%

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

pregnanciesglucoseblood_pressureskin_thicknessinsulinbmidiabetes_pedigree_functionageoutcome
021386235033.60.127471
1084823112538.20.233230
2014500044.20.630311
30135684225042.30.365241
41139624148040.70.536210
50173783226546.51.159580
64997217025.60.294280
78194800026.10.551670
828365286636.80.629240
92899030033.50.292420

Last rows

pregnanciesglucoseblood_pressureskin_thicknessinsulinbmidiabetes_pedigree_functionageoutcome
1990311190127828.40.495290
19916102820030.80.180361
19926134702313035.40.542291
1993287023028.90.773250
199417960424843.50.678230
199527564245529.70.370330
19968179724213032.70.719361
1997685780031.20.382420
199801291104613067.10.319261
199928172157630.10.547250

Duplicate rows

Most frequently occurring

pregnanciesglucoseblood_pressureskin_thicknessinsulinbmidiabetes_pedigree_functionageoutcome# duplicates
24528172157630.10.5472506
990173783226546.51.1595805
24728365286636.80.6292405
2562899030033.50.2924205
3453800000.00.1742205
4244996838032.80.1453305
4254997217025.60.2942805
4444125701812228.91.1444515
4985110680026.00.2923005
5696154743219329.30.8393905